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This paper studies several problems concerning channel inclusion, which is a partial ordering 
between discrete memoryless channels (DMCs) proposed by Shannon. Specifically, majorization-based 
Q ■ conditions are derived for channel inclusion between certain DMCs. Furthermore, under general condi- 

tions, channel equivalence defined through Shannon ordering is shown to be the same as permutation of 
^ I input and output symbols. The determination of channel inclusion is considered as a convex optimization 

problem, and the sparsity of the weights related to the representation of the worse DMC in terms of 



OO 

' this sparsity, an effective iterative algorithm is established based on modifying the orthogonal matching 

o ■ 



the better one is revealed when channel inclusion holds between two DMCs. For the exploitation of 
this sparsity, an efl 
pursuit algorithm. 



^ . I. Introduction 

. The compatison between different communication channels has been a long-standing problem 

since the establishment of Shannon theory. Such comparisons are usually established through 
partial ordering between two channels. Channel inclusion [1] is a partial ordering defined for 
DMCs, when one DMC is obtained through randomization at both the input and the output of 
another, and the latter is said to include the former. Such an ordering between two DMCs implies 
that for any code over the worse (included) DMC, there exists a code of the same rate over 
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the better (including) one with a lower error rate. This enables ordering functions such as the 
error exponent or channel dispersion. Channel inclusion can be viewed as a generalization of the 
comparisons of statistical experiments established in [2], [3], in the sense that the latter involves 
output randomization (degradation) but not input randomization. There are also other kinds of 
channel ordering. For example, more capable ordering and less noisy ordering [4] enable the 
characterization of capacity regions of broadcast channels. The partial ordering between finite- 
state Markov channels is analyzed in [5], [6]. Our focus in this paper will be exclusively on 
channel inclusion as defined by Shannon [1]. 

It is of interest to know how it can be determined if one DMC includes another either 
analytically, or numerically. To the best of our knowledge, regarding the conditions for channel 
inclusion, the only results beyond Shannon's paper [1] are provided in [7], [8], and there is not 
yet any discussion on the numerical characterization of channel inclusion in existing literature. 
In this paper, we derive conditions for channel inclusion between DMCs with certain special 
structure, as well as channel equivalence, which complements the results in [7] in useful ways, 
and relate channel inclusion to the well-established majorization theory. In addition, we delineate 
the computational aspects of channel inclusion, by formulating a convex optimization problem 
for determining if one DMC includes another, using a sparse representation. Compared to the 
conference version [9], this paper contains significant extensions. As an example, for the purpose 
of obtaining a sparse solution, we develop an iterative algorithm based on modifying orthogonal 
matching pursuit (OMP) and demonstrate its effectiveness. Moreover, we also find necessary 
and sufficient conditions for channel equivalence. 

The rest of this paper is organized as follows. Section II establishes the notation and describes 
existing literature. Section III derives conditions for channel inclusion between DMCs with 
special structure. Computational issues regarding channel inclusion are addressed in Section 
IV, followed by Section V establishing a sparsity-inducing algorithm for establishing channel 
inclusion. Section VI concludes the paper. 

II. Notations and Preliminaries 

Throughout this paper, a DMC is represented by a row-stochastic matrix, i.e. a matrix with 
all entries being non-negative and each row summing up to 1. All the vectors involved are row 
vectors unless otherwise specified. The entry of matrix K with index and the entry of 
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vector a with index i are denoted by and a(j) respectively. The maximum (minimum) 

entry of vector a is denoted by max{a} (min{a}), and a > specifies entry-wise non-negativity. 
The i-th row and j-th column of K are denoted by and respectively. The set of 

indices from rii to n2 > rii is denoted by rii : n2. The n x m matrix with all entries being 
(or 1) is denoted by 0„xm (Inxm)- Also for convenience, we identify a DMC and its stochastic 
matrix, and apply the terms "square", "doubly stochastic" and "circulant" for matrices directly 
to DMCs. We next reiterate some of the definitions and results in the literature related to this 
paper. We have the following definitions. 

Definition 1: A DMC described by ni x nii matrix Ki is said to include [1] another n2 x m2 
DMC K2, denoted by Ki D K2 or K2 C Ki, if there exists a probability vector g G IR+ and f3 
pairs of stochastic matrices {Ra,Ta}a=i such that 

Y,9(a)RaKlTa = K2. (1) 

a=l 

Ki and K2 are said to be equivalent if Ki D K2 and K2 ^ Ki. We say K2 is strictly included 
in Ki, denoted by K2 C A'l, if A'2 C Ki and Ki ^ K2- Intuitively, K2 can be thought of as 
an input/output processed version of Ki, with (^(q,) being the probability that A'l is processed by 
Ra (input) and (output). An operational interpretation of this definition is given in Figure 1, 
where, to "simulate" K2, the channel R^K^Ta is used with probability g(^a)- 
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Fig. 1. Operational interpretation of K2 C Ki, with Ki of size ni x mi and _?("2 of size 712 x m2 

Definition 2: A DMC A'2 is said to be a (output) degraded version [2], [3] of another DMC 
A'l, if there exists a stochastic matrix T such that KiT = A'2. 

Note that output degradation in Definition 2 is stronger than inclusion in Definition 1 . There are 
several analytical conditions for channel inclusion derived in [7] for a special case of Definition 
1 with /3 = 1. Reference [7] considers two kinds of DMCs, given by a 2 x 2 full-rank stochastic 
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matrix P, and an n x n stochastic matrix with identical diagonal entries p and identical off- 
diagonal entries (l—p) / (n—l), respectively. Necessary and sufficient conditions for K2 = RKiT 
where R and T are stochastic matrices, are derived for the cases in which Ki and K2 are of 
either of the two kinds. Note that this assumes /3 = 1 in (1) and is with loss of generality. 
Conditions of inclusion for the general (3 > \ case have not yet been considered in the literature. 

Channel inclusion can be equivalently defined with i?Q,'s and Tq,'s in Definition 1 being 
stochastic matrices in which all the entries are or 1, as stated in [1], where i?Q,'s and T^'s 
of this kind are called pure matrices (or pure channels). This is easily corroborated based on 
the fact that every stochastic matrix can be represented as a convex combination of such pure 
matrices. This is due to the fact that the set of stochastic matrices is convex and that (0, 1) 
stochastic matrices are extremal points of this set [10, Theorem 1]. When and are pure 
matrices, the product RaKiTa can be interpreted as a DMC whose input labels and output labels 
have been either permuted or combined. Therefore channel inclusion implies that the included 
DMC K2 is in the convex hull of all such matrices, as seen in (1). 

By considering N uses of a DMC K, we equivalently have the DMC K'^^ which is the 
A^-fold Kronecker product of K. We have the following theorem, which was mentioned in [1] 
without a detailed proof. 

Theorem 1: K2 C Ki implies K®^ C Kf^. 
Proof: See Appendix A. ■ 

As shown in [1], K2 C Ki has the implication that if there is a set of M code words {li?;}/^^ 
of length N , such that an error rate of Pc is achieved with the code words being used with 
probabilities {pi]fii under K2, then there exists a set of M code words of length N, such that 
an error rate of < Pe is achieved under Ki with the code words being used with probabilities 
{Pi]iLi- In P-116], this implication is stated as one DMC being better in the Shannon sense 
than another (different from channel inclusion ordering itself), and it is pointed out that Ki ^ K2 
is a sufficient but not necessary condition for Ki to be better in the Shannon sense than A'2, 
with the proof provided in [12]. This ordering of error rate in tum implies that the capacity of 
Ki is no less than the capacity of A'2, and the same ordering holds for their error exponents. 

Channel inclusion, as defined, is a partial order between two DMCs: it is possible to have 
two DMCs Ki and K2 such that A'l ^ A'2 and A'2 ^ A'l. For the purpose of making it possible 
to compare an arbitrary pair of DMCs, a metric based on the total variation distance, namely 
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Shannon deficiency is introduced in [8]. In our notation, the Shannon deficiency of Ki with 
respect to K2 is defined as 




where g G IR+ is a probability vector, i?a's and Ta's are stochastic matrices, ||A||oo — niaxj || .) ||i = 
II 111 is the 00-norm of matrix A, and we impose matrix transpose since we treat channel 
matrices as row- stochastic instead of column- stochastic. Intuitively, the above Shannon deficiency 
quantifies how far A'l is from including K2. Other useful deficiency-like quantities are established 
in [8] by substituting the total variation distance with divergence-based metrics obeying a data 
processing inequality between probability distributions. 

III. Analytical Conditions for Channel Equivalence and Inclusion 

In general, given two DMCs A'l and K2, there is no straightforward method to check if 
one includes the other based on their entries. Nevertheless, it is possible to characterize the 
conditions for channel inclusion, for the cases in which both Ki and K2 have structure. In this 
section, we derive conditions for the cases of doubly stochastic and circulant DMCs. For the 
case of equivalence between two DMCs, we establish a necessary and sufficient condition which 
is effectively applicable to any DMCs. We first define some useful notions. 

Definition 3: For two vectors a, b G M", a is said to majorize (or dominate) b, written a b, 
if and only if ^li a\^^ > Eli b\^^ for k = 1, . . . ,n - 1 and Er=i = ELi h)^ where a\^^ 
and are entries of a and b sorted in decreasing order. 

Definition 4: A circulant matrix is a square matrix in which the i-th row is generated from 
cyclic shift of the first row by i — 1 positions to the right. 

Definition 5: An n x n matrix P is said to be doubly stochastic if the following conditions 
are satisfied: (i) [P]{i,j) > for z,j = (ii) = 1 for j = (iii) 

T.j[P]{t,j) = 1 for2 = 

Definition 6: A DMC is called symmetric if its rows are permutations of each other, and its 
columns are permutations of each other [13, p. 190]. 

It is easy to verify that if a symmetric DMC is square, then it must be doubly stochastic. In 
the next section, we will focus mostly on square DMCs (i.e. DMCs with equal size input and 
output alphabets), and we assume this condition unless otherwise specified. 
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A. Equivalence Condition between DMCs 

We address the general condition for two DMCs to be equivalent, which has not been 
considered in the literature. By imposing some mild assumptions, we have the following theorem 
which gives the equivalence condition between two DMCs. 

Theorem 2: Let two DMCs Ki and K2 satisfying the following three assumptions 

• ASl Capacity-achieving input distribution(s) contain no zero entry; That is, the capacity is 
not achieved if some of the input symbols is not used; 

• AS2 There is no all-zero column, and no column being a multiple of another; 

• AS3 If Ki = P1K1P2D with permutation matrices Pi, P2 and diagonal matrix D, then it 
is required that Pi, P2 and D are identity matrices; That is, by permuting the rows and 
columns of Ki, it is not possible to obtain a DMC whose columns are proportional to Ki. 
This property also applies for K2. 

Then a necessary and sufficient condition of Ki being equivalent to K2 is that K2 = RKiT 
with R and T being permutation matrices (thereby requiring Ki and K2 being of the same size 
n X m). 

Proof: See Appendix B. ■ 
We have the following remarks about Theorem 2. ASl is verifiable through Blahut-Arimoto 
Algorithm [13, ch. 13]. Specifically, capacities can be obtained for the n x m DMC K itself 
and the ones obtained by removing the k-th row from K for k = 1, . . . ,n, and if the capacity is 
always reduced by removing a row, then the capacity-achieving input distribution of K should 
have no zero entry. ASl can be verified simply by inspection. Also, since DMCs are usually of 
small sizes in practice, it is viable to verify AS3 by inspection. For example, no column being 
a multiple of some entry-permuted version of another column makes a sufficient condition for 
AS3 to hold. 

If two DMCs satisfying the above three assumptions are equivalent, there is an eigenvalue- 
based approach for finding the permutation matrices without searching for all n\m\ such per- 
mutations. Starting from K2 = RKiT, and R^ = R~^, = which is a property of 
permutation matrices, we have K2K2 = RKiK^ R~^ which leads to the determination of R. In 
order to do this, the first step is to perform the eigenvalue decomposition: KiKf = QiAQ^^ and 
K2K2 = Q2AQ2^, where A is a diagonal matrix, Qi and Q2 are both unitary matrices. Notice 
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that it is necessary for KiKf and K2K2 to have the same set of eigenvalues, otherwise Ki 
and K2 cannot be equivalent. Once we have these decompositions, we can immediately obtain 
R = Q2Qi^, which is required to be a permutation matrix for Ki and K2 to be equivalent. The 
determination of T can also be made using the same approach based on K2K2 = Kf KiT , 
i.e. following from the eigenvalue decompositions KfKi = Q^T^Q^^ and K2K2 = Q^T^Q^^, 
T = QaQ^^ can be obtained. 



B. Inclusion Conditions for Doubly Stochastic and Circulant DMCs 

Considering that doubly stochastic matrices have significant theoretical importance, and dou- 
bly stochastic DMCs can be thought of as a generalization of square symmetric DMCs, we first 
introduce the following theorem 

Theorem 3: Let Ki and K2 he n x n doubly stochastic DMCs, with wi and W2 being the 

X 1 vectors containing all the entries of A'l and K2 respectively. Then W2 -< wi is a necessary 
condition for K2 C Ki. 

Proof: See Appendix C. ■ 

It should be pointed out that the above mentioned condition is not sufficient. Otherwise, 
consider 



1 2 3 4 5 
5 12 3 4 
4 5 12 3 
3 4 5 1 2 

2 3 4 5 1 



/15,K2 



1 2 3 4 5 
5 12 3 4 

3 4 15 2 

2 5 4 1 3 

4 3 5 2 1 



/15 



(3) 



it would be implied that Ki and K2 are equivalent. However, based on Theorem 2, it can be 
verified that Ki and K2 are not equivalent since there do not exist permutation matrices R and 
T such that K2 = RKiT due to different sets of singular values of Ki and K2, thereby implying 
that W2 -< wi is not sufficient for K2 C Ki. 

Consider the case of both Ki and K2 being nxn circulant, which are used to model channel 
noise captured by modulo arithmetic and has applications in discrete degraded interference 
channels [14]. We have the following result: 

Theorem 4: Let Ki and K2 he nxn circulant DMCs, with vectors vi and V2 being their first 
rows, respectively. Then for K2 C Ki, a necessary condition is V2 ^ Vi. A sufficient condition 
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is that V2 can be represented as the circular convolution of vi and another probability vector x 
such that vi ® X = V2, which is also sufficient for output degradation. 

Proof: See Appendix D. ■ 

It is clear that a 2 x 2 doubly stochastic DMC (also known as binary symmetric channel) 
is circulant and characterized solely by the cross-over probability, thus the condition for the 
inclusion between two 2x2 such DMCs boils down to the comparison between their cross-over 
probabilities. Furthermore, for n = 3,4, it is easy to verify that if an n x n symmetric DMC is 
not circulant, there is a circulant DMC equivalent to it (for n > 5 there is no such guarantee as 
seen in (3)), therefore we can conclude that 

Corollary 1: For n = 3,4, let Ki and ^2 be n x n symmetric DMCs, which are equivalent 
to circulant DMCs K[ and A'g respectively. Let vi and V2 be the first rows of K[ and K2 
respectively. Then for K2 Ki, a necessary condition is that V2 -< vi, while a sufficient 
condition is that V2 can be represented as the circular convolution of vi and another probability 
vector in M^. 

Proof: See Appendix E. ■ 
We finally make a few remarks about inclusion between the binary symmetric channel (BSC) 
with cross over probability p < 1/2 and the binary erasure channel (BEC) with erasure probability 
e. It is well-known that BSC(p) is a degraded version of BEC(e) if and only if < e < 2p [15, 
ch. 5.6]. It can further be shown that BSC(p) C BEC(e) if and only if < e < 2p, while BEC(e) 
C BSC(p) if and only if p = 0. The "if" part follows directly from the fact that degradation 
implies channel inclusion. The "only if" part can be justified by the fact that inclusion is absent 
between BEC(e) and BSC(p) if e > 2p or p > 0. 

IV. Computational Aspects of Channel Inclusion 

In Section III, we have established analytical conditions for determining if a DMC with 
structure includes another. It is also of interest to know how this can be determined numerically 
when there is no structure. Furthermore, once it has been determined that K2 ^ Ki,it is desirable 
for g(^a) probabilities in (1) to contain as many zeros as possible to get a concise representation. 

In this section, we provide a linear programming approach to calculating Shannon deficiency, 
which also enables checking if inclusion holds. For the cases in which channel inclusion is 
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known to hold, we prove that sparse solutions exist and discuss how this sparse solution for g 
can be obtained through sparse recovery techniques, such as orthogonal matching pursuit. 

We first take a look at determining if K2 ^ Ki through convex optimization. For Ki of size 
rii X mi and K2 of size 7^2 x m2, the problem can be formulated as 



minimize 



/3 

g{a)RaKiTa - K2 



0=1 



(4) 



subject to ^s-H = l.S'Ca) > 



a=l 



with variables g G M.^, where Ra is n2 x rii, and is nii x m2 stochastic matrices for 
a = 1, . . . , /3, and K2 C Ki is determined if the optimal value is zero. As mentioned in Section 
II, RaS and T^'s can be equivalently treated as pure channels, so there are at most n"^m^^ 
different {Ra, T^} pairs, and consequently there are finitely many (7(q,)'s involved in the problem 
(4). It is easy to see that (4) is a convex optimization problem, and it can be re-formulated as a 
linear programming problem with variables g(^ct) and an ?i2 x 1 vector c 



minimize l^c 



subject to — c < 



■ 

g[a)RaKiTa - K2 



a=l 



< C, for j = 1, ... ,7712, 



(5) 



/9 

We also notice that the optimal value of (5) provides a way to evaluate the Shannon deficiency 
of Ki with respect to K2. 

In the above analysis, the maximum number of {Ra,Ta} pairs, given by n^^rn^^ (or (n!)^ 
if both Ki and K2 are n x n doubly stochastic), grows very rapidly with the sizes of Ki and 
K2. With K2 ^ Ki already determined, it is natural to ask if (1) can hold with some reduced 
number of {Ra,Ta} pairs. In other words, we seek to have a sparse solution of g. We have the 
following theorem regarding the sparsity of g given K2 C Ki, based on Caratheodory's theorem 
[16, p.l55]. 

Theorem 5: For two DMCs Ki of size ni x nii and K2 of size n2 x m2, if K2 C Ki, there 
exist a probability vector g G IR+ and (3 pairs of stochastic matrices {Ra,Ta]a=i such that (1) 
holds with (3 < 71,2(^2 — 1) + 1. If both A'l and A'2 are n x n doubly stochastic, the number of 
necessary {Ra,Ta} pairs in (1) can be improved as /3 < (n — 1)^ + 1. 
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Proof: See Appendix F. ■ 
It is well-known that a typical approach to recover a sparse signal vector from its linear 
measurements is compressed sensing with ii norm minimization (also known as basis pursuit). 
To apply this approach to our problem, we can formulate it as 

minimize ^\g[a)\ 

(6) 

subject to ^gi^a)RaKiT^ = K2 

a=l 

with variables g G M^. It is easy to prove that the optimal g always comes out non-negative 
given K2 ^ Ki - However, (6) does not necessarily give a sparse solution for g. As pointed out in 
[17] which addresses the solvability of a sparse probability vector based on linear measurements 
through ii norm minimization, in order for the sparse probability vector to be solvable, the 
number of independent measurements needs to be at least two times the sparsity level. In our 
case this is not satisfied, since the number of independent equations (n2(m2 — 1) or {n — 1)^) in 
the constraints in (6) is usually less than 2(3 (which can be up to 2n2(m2 — 1) + 2 or 2(?i — 1)^ + 2). 
There are also other sparsity-inducing numerical methods such as matching pursuit, which will 
be addressed in the next section. 

V. Channel Inclusion through OMP 

Orthogonal matching pursuit (OMP) [18] and its variants are widely investigated in the 
literature for sparse solutions of linear equations. OMP algorithm gives a possibly sub-optimal 
solution to the following problem with vector g being the variable 

minimize Q{g) = ||h — Ag\\l 

(7) 

subject to ||g||o < s 

through which the known upper bound s of sparsity level is exploited. Notice that the standard 
OMP algorithm does not impose the constraint g > 0. In the context of the channel inclusion 
problem, A is a n2m2 xn^^m™^ matrix with its a-th column = vec(RaKiTa) (i.e. 

is the vectorized version of RaKiTa by stacking its columns in a vector), and h = vec(-R'2)- 
Moreover, we have the additional constraint g > so that (7) becomes 

minimize Q{g) = ||h — ^g||2 

(8) 

subject to ||g||o < s, g > 
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where s = n2{m2 — 1) + 1. Note that if inclusion is present the solution will automatically satisfy 
||g||i = 1, without adding this as an extra constraint. The problem in (8) is related to (4) and 
(5) in the sense that if the optimal value of (8) is zero, the solution of (8) is also the solution 
of (4) and (5). 

To introduce briefly, OMP algorithm finds a sparse solution of (7) by selecting columns of 
A having inner products with the residue h — Ag with a large magnitude. This requires taking 
the absolute value of the inner products in solving (7), followed by solving a least-square (LS) 
problem. However, to solve (8) we require g to have non-negative entries. We will modify the 
standard OMP to encourage this result by not taking the absolute value of the inner products, 
which is shown in Theorem 6 to be a necessary condition for the LS solution to be non-negative 
in each entry. 

In this section, assuming channel inclusion is present, we introduce OMP-like algorithms 
which solve for a sparse probability vector involved in channel inclusion. The established 
algorithm is also applicable to other problems (e.g. solving for sparse probability vector based 
on moments of the discrete random variable [17]) with the objective of solving for non-negative 
vectors, and we will describe it in general terms. Unlike the standard OMP algorithm which 
operates without positivity constraints on the solution, the algorithms established here aim to 
find a non-negative sparse solution of g based on h and A. For this purpose, modifications 
are needed in our algorithms compared to the standard OMP algorithm which solves (7), in 
order to solve the problem in (8). For example, standard OMP relies on choosing the inner 
product with the largest absolute value, while our algorithms consider the signed inner product; 
standard OMP makes one attempt per iteration for the least-square solution, while it is possible 
for our algorithm to make multiple attempts. This is because we insist that at each iteration 
the LS solution yields non-negative entries, which depends on the column chosen at the current 
iteration. If the LS solution provides some negative entries, instead of projecting the solution to 
the non-negative orthant, we start over and select a new column with a positive inner product. 
This preserves the orthogonality of the residue with all the selected columns. The details of our 
algorithm is given as follows. 



Algorithm 1: The modified OMP algorithm for retrieving non-negative sparse vector g from 
Ag = h with known upper bound of sparsity level s consists of the following 
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Inputs: 

• An p X g matrix A with p <^ q 

• An p X 1 vector h which consists of noise-free linear measurements of g 

• The known upper bound of sparsity level s of the non-negative vector g (in general it is p; 
for the channel inclusion problem, it is as specified in Theorem 5) 

• Tolerance e, for error being essentially zero 
Outputs: 

• A flag / for a solution being found (/ = 1) or not found (/ = 0) 

• The number si of iterations for the residue to become essentially zero (if / = 1) 

• A set (vector) As^ of column indices for A, A^^ C {1, 2, . . . , g} (if / = 1) 

• An si X 1 vector g^^ (if / = 1) 
Procedure: 

Initialize the residue to fq = h, the set of indices to A = Oixs, the matrix containing the 
columns of A which are selected to A^d = Opxs, the inner product vector to P = Oixq, and the 
iteration counter t = 1. The remaining steps are given in pseudo-code as follows: 



01: 


while t < s and rf_i oo > e 




02: 


P = rl,A; 


> inner product generation 


03: 


gt = — itxi; 


> initializing the sparse vector 


04: 


while min{gf} < and max{P} > 




05: 


Xt = argmaxj Pyy, A(t) = A*; 


> locating the largest inner product 


06: 


[Aci]{:,t) = [-A]{:,x,y, > selecting 


a new column of A corresponding to At 


07: 


gt = argming ||h - [Asci](;,i:t)g||l; 


> solving a least-square problem 


08: 


P(At) = — 1; > marking index At 


as attempted to avoid multiple attempts 


09: 


end; 




10: 


Ft = h - [Ascl](:,l:t)gt; t = t + U 


> updating residue for the next iteration 


11: 


end; 





Finally, set / = 1 if min{gt_i} > and ||rt_i||oo < otherwise / = 0. With / = 1, the 
other outputs are si = t — 1, A^^ = A(i.sj), and g^j is as given at the termination of the iterations. 
The j-th entry of g^^ is the Aj-th entry of g and all other entries of g are zero. 
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Notice that the "while" loop starting Line 4 in Algorithm 1 always terminates because there 
are always finitely many positive inner products available for selection. Algorithm 1 inherits 
the keys steps directly from the standard OMP algorithm, as seen from Lines 2, 6, 7 and 10. 
It differs from the standard OMP algorithm in that it aims to find a non-negative least- square 
solution at each iteration unless all the positive inner products are depleted, which is reflected 
by Line 4. As seen from numerical simulations, it has a very low rate of failure in the sense 
that it returns several / = out of a very large number of tests in which channel inclusion is 
present. An illustration of this is given in Figure 2, which shows the rate of failure of Algorithm 
1, with f3 = 1,2,3,4,5 and randomly generated stochastic matrices Ki, {Ra,To,}a=i as well as 
probability vector {ga}a=i- Specifically, all matrix and vector entries are generated according to 
uniform distribution in [0, 1] and then normalized to satisfy probability constraint. We can also 
observe that the rates of failure are very close for different values of (3. 




1.5- - 

1 I ' ' ' 

1 2 3 4 5 

P 

Fig. 2. Rate of failure of Algorithm 1 with randomly generated stochastic matrices Ki and A'2 C A'l 

Failures occur if the algorithm produces a vector g^^ that has negative entries. It is natural 
to ask why Algorithm 1 produces failures. We rule out the selection of a positive inner product 
(as reflected in Lines 4 and 5) from being the reason, as justified by the following theorem. 
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Theorem 6: In Algorithm 1, the selection of a positive inner product (as reflected in Lines 
4 and 5) is necessary for the least-square solution (in Line 7) to be non-negative. Moreover, at 
each iteration, vector P always has at least one positive entry, so that a (not yet selected) column 
of A having a positive inner product with the residue is always possible. 

Proof: See Appendix G. ■ 

Theorem 6 implies that no mistake is made by not considering the negative inner products. 
Thus we believe that the failures produced by Algorithm 1 are due to the fact that not all the 
possible selections of inner products are attempted. Going one step further from Algorithm 1, 
it is desirable to establish an improved algorithm which is always successful. We now describe 
the algorithm which can be proved based on a forthcoming conjecture to be always successful 
in solving for sparse probability vector involved in channel inclusion, provided that inclusion is 
present. 



Algorithm 2: The modified OMP algorithm for retrieving non-negative sparse vector g from 
Ag, = h with known upper bound of sparsity level s consists of the following 
Inputs: 

• An p X q matrix A with p <^ q 

• An p X 1 vector h which consists of noise-free linear measurements of g 

• The known upper bound of sparsity level s of the non-negative vector g (in general it is p; 
for the channel inclusion problem, it is as specified in Theorem 5) 

• Tolerance e, for error being essentially zero 
Outputs: 

• A flag / for a solution being found (/ = 1) or not found (/ = 0) 

• The number si of iterations for the residue to become essentially zero (if / = 1) 

• A set (vector) A^^ of column indices for A, A^^ C {1, 2, . . . , g} (if / = 1) 

• An si X 1 vector g^^ (if / = 1) 
Procedure: 

Initialize the residue to fq = h, the set of indices to A = Oixs, the matrix containing the 
columns of A which are selected to = Op^s, the inner product matrix to P = Osxq, and the 
iteration counter t = 1. For observation purpose we also count the actual number of iterations 
tact, which is initialized as zero. The remaining steps are given in pseudo-code as follows: 
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01 
02 
03 
04 
05 
06 
07 
08 
09 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 



while 1 <t < s and ||rj„i||oo > e 

if max{[P](t,)} < and min{[P](t,)} < 

[P](t .) = Oixq', t = t — I; > resetting inner product and tracing back 

else 

if [Ph,:) 

end; 

gt = —^txU 

while min{gf} < and max{[P](t .)} > 
At = argmaXj[P](tj); A(t) = At; 

[^sel]{:,t) = [A]{:,X,y, 



> inner product generation 
> initializing the sparse vector 



gt = argming ||h — [A 



> locating the largest inner product 
> selecting a new column of A corresponding to At 
scij(:,i:t)g||2; ■> soMng a least-square problem 



[P] 



(i,At) 



-1; > marking index At as attempted to avoid multiple attempts 



end; 

if min{gt} < 

[P]it,) = Oix,; t = t - 1 

else 

rt = h - [Asel]{:,l:t)gt; t 

end; 
end; 

^act ^act ~l~ Ij 



t + 1; 



> resetting inner product and tracing back 
> updating residue for the next iteration 



end; 



Finally, set / = 1 if t > 1 and ||rt_i||oo < ^, otherwise / = 0. With / = 1, the other outputs 
are Si = t — 1, A^^ = A(i.sj), and g^^ is as given at the termination of the iterations. The j-th 
entry of g^^ is the Aj-th entry of g and all other entries of g are zero. 



Algorithm 2 differs from Algorithm 1 primarily in the following two aspects: the inner product 
is changed from a vector into a matrix, as reflected in Line 6, for the purpose of recalling 
the values of inner products involved in the past iterations. Moreover, the iteration may go 
backward, as reflected by Lines 3 and 16, in the sense that the most recently added columns 
of Asci may be deleted in order to "backtrack". In Algorithm 2, the iteration proceeds at t 
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when a new column of A can be found, such that with [Asci](;,t) updated as this new column, 
it follows that = argming||h — [Asci](;,i:t)g||2 ^ 0' i-^- the least-square solution of the 
sparse vector is non-negative in each entry; otherwise, the iteration traces back and updates 
the selection of [Asci](:,t_i), for the purpose of making it possible to find [Asci](:.t) such that 
gt = argming ||h — [y4sci](:,i:t)g||2 > 0. When the iteration proceeds, the residue is updated for 
inner product generation in the next iteration; when the iteration traces back, the inner product 
is reset, in order to enable its re-generation when the iteration proceeds to this step a second 
time. 

We now introduce the following conjecture which will lead to the effectiveness (to be proved 
in Theorem 7) of Algorithm 2. 

Conjecture 1: Let G be a matrix with all entries being non-negative and all columns being 
linearly independent. There exists at least one column of G such that, with obtained by 
excluding g* from G, x := argminx ||g* — G*x||2 has non-negative entries. 

Conjecture 1 points out that among several linearly independent non-negative vectors, there 
is at least one of them, whose orthogonal projection onto the hyperplane defined by the other 
vectors is a conic combination of those vectors. In the following, we show the effectiveness of 
Algorithm 2, as stated in Theorem 7. 

Theorem 7: If Conjecture 1 holds, then Algorithm 2 does not fail, i.e. / = 1 is returned when 
inclusion is present. 

Proof: See Appendix H. ■ 
For Algorithm 2 to fail, A^cx must have no column of A, and all the columns of A have 
been attempted but none of them is selected eventually. These possible multiple attempts all 
occur at t = 1, when A^d has no column of A. Theorem 7 effectively rules out this possibility, 
and implies that Algorithm 2 is guaranteed to work by searching for a non-negative least- 
square solution at each iteration, in the sense that there exists a path of iterations, in which 
an atom (a column of A) associated with a positive inner product is selected at each iteration, 
eventually leading to a solution with all entries of g^^ being non-negative. Essentially, Theorem 
7 implies that by only focusing on the selection of a new column which results in a non-negative 
intermediate solution g^ (as reflected in Lines 9 and 15 of Algorithm 2), we do not have the 
risk of driving Algorithm 2 into failure. If Algorithm 1 or 2 terminates with / = 1, the residue 
can be treated as zero. From this, it can be shown that g^^ is a probability vector: consider the 
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product Iixn2m2([^sci](:,i:si)gsi — h) = 0, wc havc n2lixsigsi = ^2, which shows that the entries 
of gsi sum up to 1, i.e. a sparse probability vector relating Ki and K2 is obtained. 

By performing the same numerical tests (i.e. for both Ki and K2 being 3 x 3 or 4 x 3, with 
randomly generated stochastic matrices Ki, {RajTa}^^^ as well as probability vector {ga}a=i^ 
as performed on Algorithm 1, it is observed that Algorithm 2 produces no failure in 5 x 10^ 
tests for each case. It is also seen that Algorithm 2 does not invoke many backtracks in practice 
if inclusion is present, which is as expected given the fact that Algorithm 1 has a very low rate 
of failure. 

Furthermore, starting from two given DMCs Ki and K2 without knowing the presence or 
absence of inclusion, for the purpose of determining if inclusion is present, the ii minimization 
approach given by (5) should be used since it provides guaranteed correctness about the presence 
or absence of inclusion. Once the presence of inclusion is identified, for the purpose of obtaining 
a sparse probability vector relating A'l and K2, Algorithm 1 can be used first, and if Algorithm 
1 does not return a sparse probability vector as desired. Algorithm 2 becomes the choice for this 
purpose. Although we do not have a proof that Algorithm 2 does not incur a lot of backtracking, 
we known empirically that it is the case, and thus Algorithm 2 is favorable in the sense that 
it makes a more effective and less complex approach for obtaining a sparse solution than ii 
minimization approach. 

VI. Conclusions 

In this paper, we investigate the characterization of channel inclusion between DMCs through 
analytical and numerical approaches. We have established several conditions for equivalence 
between DMCs, and for inclusion between DMCs with structure including doubly stochastic, 
circulant, and symmetric DMCs. We formulate a linear programming problem leading to the 
quantitative result on how far is one DMC apart from including another, which has an implication 
on the comparison of their error rate performance. In addition, for the case in which one DMC 
includes another, by using Caratheodory's theorem, we derive an upper bound for the necessary 
number of pairs of pure channels involved in the representation of the worse DMC in terms of 
the better one, which is significantly less than the maximum possible number of such pairs. This 
kind of sparsity implies reduced complexity of finding the optimal code for the better DMC 
based on the code for the worse one. By modifying the standard OMP algorithm, an iterative 
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algorithm that exploits this sparsity is established, which is seen to be significantly less complex 
than basis pursuit and produces no failure in determining the presence or absence of channel 
inclusion. Such effectiveness in determining the presence or absence of channel inclusion is 
proved with the help of a conjecture. 

Appendix A. Proof of Theorem 1 

Given (1), it follows that 

Y,9[c.)RaK^TA =Kf'' (9) 

Based on the bilinearity of Kronecker product, the left hand side of (9) can be expanded into the 
summation of terms, which are all in the form of ^j{iy.{i,...,N}^{i,...,i3}([liLi 9j{i))[{Rj{i)^'^iTj{i) 
■ ■ ■ ® {Rj(M)KiTj(^N))] where the summation is over all possible functions {1, . . . , A^} — t- 
{1, . . . , /?}. Based on the mixed-product property of Kronecker product, we have 

(i?,-(i)i^iT,-(i)) ® {Rj^2)KiTji2)) = {RmKi) ® (i?j(2)/^i)(T,(i) ®T,-(2)) 

(10) 

By applying (10) repeatedly, it follows that (Hii fi'j(i))[(i?j{i)A'iT,(i))(8)- ■ ■®{Rj[N)KiTj(^N))] = 
{YliLi fi'j(j))(-^i{i) ® ■ ■ '®Rj(N))Kf^{Tj(^i)®- ■ ■(8)Tj(Ar)), which in turn implies that the left hand 
side of (9) expands into terms in the form of (Hili fi'i(*))(-Ri(i) ® ■ ■ • ® Rj(N))Kf^ {Tj(^i-) (g) 
• ■ ■ ® T,(jv)), and thus Kf^ C A^f ^ by Definition 1. 

Appendix B. Proof of Theorem 2 
Consider Ki of size ni x nii and A'2 of size 722 x m2. By Definition 1 we have 

Ki= ^ gi{ai)Ri,aiK2Ti^ai (Ha) 

ai=l 

fi'2(a2)-R2, 02-^1^2,02 (lib) 

a2=l 

with -Ri^Q-i's of size rii x n2, Ti^q^'s of size m2 x mi, -R2,a2's of size 722 x ni, T2_q,2's of size 
nil X 7712, all of which are "pure" DMCs. By plugging (lib) into (11a), it can be seen that 

Kl = fl'l(ai)5'2(a2)-^l,ai -^2,02-^1 ^2,a2^1,ai (1^) 

ai=l,a2=l 

DRAFT 



19 



is expressed as a convex combination involving the terms -Ri,ai-R2,a2-^i^2,a2^i,ai- We first 
establish the following lemma as an intermediate step. 

Lemma 1: There should be only one term of the form -Ri,ai-R2,a2 -^'1^2,02^1,01 the right 
hand side of (12), i.e. (3i = /32 = 1, with full-rank Ri,aiR2,a2 and T2,a2^i,ai- 

Proof: Let Ci be the capacity of Ki, Ca^^ai be the capacity of i?i,aj-R2,a2-^i^2,a2^i,ai- Let 
I{K, p) denote the mutual information of DMC K with the input distribution represented by 
row vector p. Let p'^ be the capacity-achieving input distribution of Ki. We also denote this 
distribution in terms of the probability mass function (PMF) j9^(x) of x = 1, . . . , ni as needed. 
Denote the entry of -Ri,ai-R2,a2-^'i^2,a2^i,ai with index {x,y) by Pai^a2{y\^)^ considering that 
they describe transition probabilities. Based on [13, Theorem 2.7.4], we have 

(13) 

Note that /(i?i,Qi-R2,a2-^i^2,a2^i,Qi5 P^) < C'ai.aa. and C„i,Q2 < Ci = I{Ki, p^). It is clear that 
if /(-Ri^Q,^-R2,a2-^i^2,a2^i,ai' P^) < -^(-^"^iiP^) for ^ny {01,02}, it will follow from (13) that 
Ci < Ci which is contradictory. Therefore, it is required that -R2,a2-^i ^2,02^1,01 5 P^) = 

I{Ki,p^) for all {ai,a2}. In what follows, we show that /(_Ri^q,j_R2,q2-^i^2,q2^i,qu P^) < 
I{Ki,p^) holds for the cases in which Ri^a^R2,a2 or ^2,a2^i,ai riot full-rank, thereby ruling 
them out. 

We first consider what happens if Ri^aiR2,a2 is not full-rank, by comparing I{Ri^ai_R2,a2^i^ P^) 
with J(Jii,p^). Given the formula [13, eq. (2.111)] of mutual information 

I{X;Y) = H{Y) -J2pi^)H{Y\X = x) (14) 

X 

it is easy to see that I{Ri^aiR2,a2^iiP^) = -^(-^i? P^-Ri,Qi-R2,a2)' since -Ri^Q,ji?2,a2-^i with 
input distribution p^ and Ki with input distribution p'^ Ri.aiR2,a2 result in the same output (V) 
distribution, as well as the same row entropy (H{Y\X = x)) distribution. With Ri^a^R2,a2 being 
not full-rank, there should be at least one zero entry in the probability vector p"^i?i Q,^i?2,a2J and 
p^i?i Q,^-R2,a2 cannot be a capacity achieving distribution for Ki, given assumption (I). On the 
other hand, based on data processing inequality [13, Th. 2.8.1], we have /(-Ri,Q:i-R2,a2-^i^2,a2^i,ai) P'^) < 

/(i?l,„ii?2,a2^1,P^)- Consequently, /(i?i,c,ii?2,a2^1^2,a2^1,ai, P^) < /(i?l,ai^2,a2^1,P^) = 

/(J^i, p^i?i aji?2,a2) < -^(-^17 P'^)^ which Icads to contradiction as discussed above, and thus 
Ri,aiR2,a2 must be full-rank for all {01,02}. 
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Second, we show that with Ri^aiR2,a2 being full-rank, /(_Ri ai-R2,Q2-^i ^2,02^1, qd P'^) < 
I{Ki, p^) holds if T2^a2Ti,ai is not full-rank, by comparing Q,^_R2,a2-^i^2,a2^i,ai5 P^) with 
I{Ri^aiR2,a2^i^P'^)- We use p{y\x) and p'{y\x) to denote the entries of Ri^aiR2,a2^i 
-Ri,ai-R2,a2 -^"^1^2,02^1,01 "^ith index respectively, considering that they describe transition 

probabilities. It is clear that with T2^a2Ti,ai being full-rank (and thus a permutation) matrix, 
/(i?i^a^i?2,a2-^i^2,a2^i,ai' P'^) = -^(-^1,01-^2,02-^15 P'^)' wc just considcr a representative case 
of T2,Q2Ti being not full-rank: T2^a2Ti,ai is obtained from switching the 1 entry at index {yi, yi) 
with the entry at index (2/1,^/2) in the mi mi identity matrix. This results in the relation 
between p{y\x) and p'{y\x) (for all x = given by: p'{y2\x) = p{yi\x) + p{y2\x), 

p'{yi\x) = 0, and p'{y\x) = p{y\x) for all other values of y from 1 through mi. Based on log 
sum inequality [13, Th. 2.7.1], for all x = 1, . . . , rii we have 



// I N X/ M p'iy2\x) 

p[y2\x)p {X)\0g- 



p{y2\x) , , I N X/ M „ p{yi\x) 



(15) 



<p{y2\x)p (x)log=7j^ — r^^^FT^ + P{yi\x)p {x)\og 



x] 



and consequently 

EU \ \ Xr P'{y2\x) 
P[y2\x)p (x)log=^^ \ \ XI \ 



ni 

E 



< 

x=l 



p{y2\ x)p^ {x) log + I x)p^ (x) log 



(16) 



Exii P(^2 1 (x) YJlLi PiVi I (a;) . 

Note that the left hand side of (16) makes part of /(i?i,Q,^ -R2,02-^i ^2,02^1, 01, P""*"), and the right 
hand side of (16) makes part of I{Ri^aiR2,a2^'^i^ P^)^ and the remaining terms in the two mutual 
informations are the same since there is no change made on the output symbols other than yi 
and y2, and consequently 

-^(-Ri, 01-^2,02-^1 ^2,02^1,01, p^) < -^(-Ri, 01-^2,02 -^1? P'^) (17) 

It is clear that for the equality to hold in (17), the equality needs to hold in (15) for x = 1, . . . , rii. 
Given assumption (I) which specifies that p^{x) > for x = 1, . . . ,ni, it follows that, the 
equality holds in (17) only when p{y2\x) /p{yi\x) is constant for x = 1, . . . , rii, or one of p{yi\x) 
and p{y2\x) is zero for x = 1, . . . , ni. This leads to the requirement that Ki has a column which 
is a multiple of another column, or an all-zero column, thereby contradicting assumption (II). 
Therefore with T2,q,2^i,q,^ being not full-rank, strict inequality holds in (17), which in tum leads to 
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I {Ri ^aiR2, 02^1^2, 02^1, ai,P^) < H^i^P^) ^1 < ^1 wWch is contradictory. We have now 
completed the proof for that Ri^aiR2,a2 and T2^a2Ti,ai need to be full-rank (and consequently 
permutation matrices) for all {01,02}. 

Following from the above conclusion, we consider what are further required for equality to 
hold in (13), based on log sum inequality [13, Th. 2.7.1]. It follows easily from this inequality 
that, for any x = 1, . . . ,ni and y = 1, . . . , nii. 



E/ I N X/ \ I 1 l^ai=l,a2=l3l{ai)g2{a2)Pai,a2\y\ 
gi{a,)g2ia2)Pa^,a2{y\x)p (x) log .^^^^ — 



X] 



\ai = l,a2 = l 

/3i,/32 



E"il Ea;=l,a2=l 9l{a^)92{a2)Pa,,a2{y\x)p 



< 9x{o.,)g2{c.2)Po.,,M^)P (x) log ^n^^^^^ ^,(y|x)p^(a:) 



01 = 1,0:2 = 1 

(18) 

It is clear that the summation of (18) over x = 1, . . . ,ni and y = 1, . . . ,mi leads to (13), 
therefore, for the equality to hold in (13), it is required that the equality holds in (18) for 
all X = 1, . . . ,ni and y = 1, . . . ,mi, which is satisfied only when the y-th column of one 
-Ri,ai-R2,a2-^i^2,ci2^i,ai ^^m is a multiple of the y-th column of another such term, for all 
y = 1, . . . ,mi. This in turn requires that "different" such terms must be related through diagonal 
matrices, e.g. it is required that 

i?l,li?2,l/^lT2,iTi,i = i?i,ii?2,2i^ir2,2Ti,iD (19) 

with D being a diagonal matrix with the diagonal entries being positive. Considering that 
i?! ii?2,i, 72,1^1,15 -Ri, 1-^2,2 and T2 2^1 1 are permutation matrices, it follows that Ki = P1K1P2D 
with Pi, P2 being permutation matrices. Given assumption (III), it is required that both Pi and 
P2 are identity matrices, and also required that D is identity, and {-Ri,Qi-R2,a2 5 ^2,^2^1,01} are 
the same for all {01,02}. Consequently, there should be only one term in the right hand side 
of (12). Thus we have proved that (3i = /32 = I, which in tum implies that we can simplify the 
notations through Ri^^i = Ri, -^2,^2 = R2, ^i.ai = Ti, Ta^^a = T2. ■ 
Now that we have established rank(_Rii?2) = n-i and rank(T2Ti) = mi, we consider what im- 
plications they have on Ri, R2, Ti, T2. Given the fact that Taiok{AB) < min{rank(y4), rank(i?)}, 
it is further implied that 

rank(i?i) > rii, rank(i?2) ^ ni,rank(Ti) > mi,rank(T2) > mi (20) 
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Similarly, by substituting (11a) into (lib), it can be derived that 

rank(i?i) > 77,2, rank(i?2) > 71,2, rank(Ti) > m2,rank(T2) > 777,2 (21) 

On the other hand, for Ri of size 77i x 772, R2 of size 772 x 771, Ti of size m2 x 7771, T2 of size 
7771 X the ranks should satisfy 

rank(i?i) < min(77i, 712), rank(i?2) < min(77i, 712), 

(22) 

rank(Ti) < min(7rai, 7772), rank(T2) < min(777i, 7772) 
Given (20), (21) and (22), it follows that rank(i?i) = rank(i?2) = r7i = 772 and rank(Ti) = 
rank(T2) = ?77i = m2. Since square full-rank (0, 1) matrices are permutation matrices, these four 
matrices must be permutation matrices, and in turn it is necessary to have K2 = RKiT with 
R and T being permutation matrices for Ki and K2 to be equivalent. It is easy to see that this 
condition is also sufficient for the equivalence between A'l and K2, and the proof is complete. 

Appendix C. Proof of Theorem 3 

We start from (1) with R^s and T^'s being pure channels, as equivalent to Definition 1. It 
is clear that entries of K2 are linear combinations of the entries of A'l. Considering the fact 
that there is a one-to-one mapping between the entries of A'l and the entries of wi, as well 
as the same situation for A'2 and W2, it follows that there is a matrix P such that W2 = Pwi, 
and the conditions for A'2 C Ki can be related to what properties the combining coefficients 
[P](jj)'s have. Based on Birkhoff's Theorem [19, p. 30], both Ki and A'2 are inside the convex 
hull of r7 X 77 permutation matrices, therefore it is sufficient for R^s and T^'s to contain only 
permutation matrices; otherwise J2a=i 9{a)RaKiTa will fall out of the convex hull of n x 77 
permutation matrices, which contradicts with the doubly stochastic assumption. Consequently, 
for each a, RaKiTa gives a matrix having exactly the same set of entries as A'l, generated by 
permuting the columns and rows of A'l. As a result, R^s and T^'s do not replace any row of A'l 
with the duplicate of another row, or merge any column into another column and then replace 
it with zeros. 

We now consider the properties of [P](j ,,)'s based on the structure of RaKiTa. We have 
^j[P](i,'j) = Yla=i 9{a) = 1 for 7 = 1, . . . , r7^ since each entry of A'l is contained in R^KiTa 
exactly once, for a = 1, . . . , /3. On the other hand, since each entry [K2]{ij) of A'2 is the convex 
combination of the entries with the same index (z, j) of RaKiT^s, while each entry of RaKiTa 
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is exactly an entry of Ki, it follows that ^J-P](i,j) = J2a=i9{a) = 1 for j = 1, . . . ,n^. Also, 
it is straightforward to see that [-P](jj) > for i,j = 1, . . . , ra^ due to the non-negativeness of 
5f(c)'s. Consequently, P is doubly stochastic, and W2 = Pwi implies that W2 -< wi [19, p. 155], 
completing the proof. 

Appendix D. Proof of Theorem 4 

Let wi and W2 be the x 1 vectors containing all the entries of Ki and K2 respectively. It is 
easy to see that wi and W2 contain the entries of vi and V2 each duplicated n times respectively. 
Given K2 C Ki, based on Theorem 3 we know that W2 -< wi, thus Yli=i^''^i{i) ^ ZlLi ' ^2(i) 
for k = l,...,n, and it follows that Yli=i'^'i{i) — Yli=i'^2{i) ^ ~ l,...,n. In addition, 
J^i^i Vi{i) = J2^=i ^'2(i) = 1 as required for stochastic matrices, therefore V2 -< vi is necessary 
for K2 C Ki. 

We next prove that the existence of a probability vector x G such that vi ® x = V2 is 
sufficient for K2 C Ki. Let P be the n x n permutation matrix such that xP is cyclic shifted 
to the right by 1 with respect to x, and let X be the n x n matrix with the i-th column being 
pi-i-j^T jj- g^gy ^Yiat both P*^^ and X are circulant. Given vi ®x = V2, it follows that 

viX = V2 due to the definition of circular convolution. Also, notice that P*~^X = XP'~^ since 
the multiplication of two circulant matrices are commutative. Consequently, the i-th row of Ki, 
given by viP'^^, and the i-th row of K2, given by V2P*~^, are related through (viP'~^)X = 
viXP*~^ = (v2P*~^). It then follows that KiX = K2 with a stochastic matrix X, i.e. Definition 
1 is satisfied, and the proof is complete. 

An alternative proof based on FFT: Let U be the n x n FFT matrix. Then vi ® x = V2 
^ PPr(vi) o FFT{x) = PPT(v2) diag(PPr(vi))diag(PPr(x)) = diag(PPT(v2)) 
f/diag(PPr(vi))t/*t/diag(PPr(x))f/* = f/diag(PPT(v2))f/* K^X = K2, and the proof 
is complete. 
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Appendix E. Proof of Corollary 1 

It is clear that 3x3 symmetric DMCs which are not circulant can have only the following 
layout: 

'l 2 3 

2 3 1 (23) 

3 1 2 

and it can be made circulant by permuting the second and third rows. Also, 4x4 symmetric 
DMCs which are not circulant can have only the following layouts: 



(24) 



1234 1234 1234 
2143 2143 3142 
3412'342l'2413 
4 3 2 ij |_4 3 1 2J |_4 3 2 1 
together with other layouts obtained by permuting their rows. For each of these layouts, it is easy 
to check with MATLAB that there exists column permutations which can make each of its rows 
cyclic shift of the others. Therefore for n = 3, 4, n x ra symmetric DMCs can be transformed 
into circulant DMCs. Consequently, the results in Theorem 4 can be applied to circulant DMCs, 
and the second statement of the corollary holds. 



Appendix F. Proof of Theorem 5 

Since an n2 x m2 stochastic matrix is determined by its n2 rows and first m2 — 1 columns, 
the class of all n2 x m2 stochastic matrices can be viewed as a convex polytope in ^2(7712 — 1) 
dimensions. We apply Caratheodory's theorem [16, p. 155], which asserts that if a subset § of 
M'" is A;-dimensional, then every vector in the convex hull of § can be expressed as a convex 
combination of at most A; + 1 vectors in §, on (1) with Ki of size rii x mi and K2 of size 
77-2 X 777,2. It is clcar that RaKiTa and K2 are at most 772(7772 — 1) -dimensional. Therefore if K2 
is in the convex hull of {RaKiTa}'^^^, it can be expressed as a convex combination of at most 
772(7722 — 1) + 1 matrices in {RaKiTa}a=i^ i-^- the number of necessary {Ra,Ta} pairs can be 
bounded as (3 = (3i < 772(7772 — l) + lif(l) holds. A similar proof can follow for the case of 
both Ki and K2 being nx n doubly stochastic, in which they are at most {n — 1) ^-dimensional. 
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Appendix G. Proof of Theorem 6 

First, we prove that at the t-th iteration, the t-th entry of gf, denoted by gt{t), has the same 
sign as (rf_i, [Asci](:,t)) (we use the notation < ■, ■ > for inner product in order to make it clearly 
identifiable as a scalar), and therefore selecting a negative inner product would never produce 
a gt > 0, based on the orthogonality property that rt„i is perpendicular to all the columns of 
[^sei](:,i:i-i)- Supposc [Asci](:,t) is the sclcctcd column of A. Based on 

h = rt^l + [^cl](:,l:i-l)gt-l = rj + [Ascl](:,l:t)gt (25) 

we have 

[Ael]{:,l:t)gt = [^el] {:,l:t-l)gt-l + ^t-l " (26) 

By taking the inner product of (26) with r^.i, we have 

(rt_i, [Asci](:,t)) gt(t) = (rj_i,rj„i - rt) = ||rt_i||2 ( ||rt^i||2 - m*^ ) (27) 

V l|rt-i||2 / 

Clearly, (r(_i,r() /||rt_i||2 is the scalar projection of onto rt_i and thus (rt_i,rt) /||rt_i||2 < 
II II 2. In addition, ||rt||2 < ||rt-i||2 due to the involvement of an additional column in the least- 
square problem. Consequently, (27) is positive, which implies that gt(^t) has the same sign as 
(rt_i, [Asci](;,j)) and (rj_i, [Asci](:,t)) > is necessary for gt > 0. 

Second, we prove that it is always possible to select a column [Asei](:,j) from A, such that 
(ri_i, [Asci](:,t)) > 0, before the iterations terminate (i.e. ^ Opxi)- Define three sets of p x 1 
vectors 5*1 = {v| {rt,v) > 0}, ^2 = {v| (rt, v) < 0} and S3 = {v| (rt,v) = 0}. It is clear that 
Si, S2 and 5*3 are mutually exclusive and are all convex. It is also clear that all the columns of 
[^sci] {:,!:*) IS lu S3, aud h G 5*1 based on (25). If there is no column of A which is in ^i, then h 
cannot be in the convex hull of the columns of A, hence contradicting the fact that Ag = h with 
some probability vector g. Therefore a positive inner product together with its corresponding 
column of A is always available for selection, and the proof is complete. 

Appendix H. Proof of Theorem 7 

We first start with two preparatory lemmas which generalize Conjecture 1. 
Lemma 2: Let nxm matrix G have all of its entries being non-negative and all of its columns 
being linearly independent, and Gxi = X2 with all entries of vector xi being non-negative, then 

DRAFT 



26 



there exists at least one column of G such that, with G^, obtained by excluding from G, 
X3 = argmin ||x2 — G'*x||2 = argmin ||(jXi — G'*x||2 > (28) 

X X 

Proof: According to Conjecture 1, there exists at least one column g^, of G such that, with 
G^, obtained by excluding g^ from G, 

Gl{g, - G,X4) = (29) 

holds with X4 > 0. Let yi be the part of xi corresponding to G^ and yg be the part of xi 
corresponding to g^,, i.e. 

X2 = Gxi = G*yi + ygg* (30) 

The assertion in (28) is equivalent to the existence of X3 > such that G^G^,X3 = G^X2 = 
G^Gxi. In order to prove this, according to Farkas' lemma [20, Proposition 1.8] which states a 
sufficient condition for such non-negative vector to exist, it suffices to prove that for any vector 
X5 such that (G^G,)^X5 = G^G.xg > 0, (G^Gx^^xg = xf G^G.xg > holds. Based on (29) 
and (30), it is clear that 

xf G^G^xg = x^G*X5 = (G*yi + Vg^^fG^y.^ = yf G^G^xs + y^x^G^G^xg > (31) 

given the known conditions that G^G^Xs > 0, X4 > 0, yi > and yg > 0, and we have proved 
(28) which generalizes Conjecture 1. ■ 
Lemma 3: There exists a set of matrices {Gfc}^^^, in which Gm-i is obtained by excluding 
one column from G and Gk is obtained by excluding one column from Gk+i for k = 1, . . . ,m — 2, 
such that (28) holds with G* replaced by any matrix in {Gfc}^/, i.e. argminx ||x2 — GfcxHg > 
with X2 = Gxi. 

Proof: Equation (28) implies that, for X2 = Gxi, there exists at least one column g* of 
G such that, with G^-i obtained by excluding g* from G, the orthogonal projection G^^iXa 
of X2 onto the column space of Gm-i is inside the convex cone generated by the columns 
of G„i_i. Furthermore, for any matrix Gm_2 whose columns form a subset of the columns of 
Gm~i, it is easy to notice that the orthogonal projection of G^^iXs onto the column space 
of Gm-2 is identical to the orthogonal projection of X2 onto the column space of Gm~2, i-C- 
argminx ||x2 — Gm_2x||2 = argminx \\Gm-i^3 — Gm-2x||2- Considering that X3 > as proved 
for (28) above, it follows that there exists a matrix Gm_2 obtained by excluding a column from 
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Gm-1, such that argminx ||Gm-iX3 — Gm-2x||2 > 0, and thus argminx ||x2 — Gm-2x||2 > 0. 
This in turn implies that by excluding the columns of G one by one, it is always possible to 
guarantee that the orthogonal projection of X2 onto the linear space formed by the remaining 
columns, is inside the convex cone generated by the remaining columns, at each step, i.e. there 
exists {Gk}^Si, in which Gm-i is obtained by excluding one column from G and Gk is obtained 
by excluding one column from Gk+i for = 1, . . . , m — 2, such that arg minx ||x2 — G^xHg > 
for k = 1, . . . ,m — 1. ■ 
Note that since argminx ||x2 — 6*^x112 > is not affected by permuting the columns of 
Gk, Lemma 3 can be altematively stated as follows: there exists at least one column permuted 
version of G, say Gm, such that argminx IICxi — i.fc)x||2 > holds for k = 1, . . . ,m — l. 

This will be applied to prove that Algorithm 2 can successfully find a sparse probability vector 
involved in channel inclusion. Here we use the notations in the description of Algorithm 2, 
and also new notations as needed. With the presence of inclusion and the actual sparsity level 
being si, there exists at least one px si matrix A^^^ with linearly independent columns, together 
with si X 1 vector g*^ > 0, such that A*cigsi = h, where A^^^ is defined as a p x si matrix 
whose columns form a subset of the columns of A. Accordingly, there exists at least one column 
permuted version of A^^^, say Ag-^, such that 

argmin ||h - [AJ(,i;fc)g||2 > (32) 
g 

holds for /c = 1, . . . , si — 1, and also for k = si since Ag-^g'^^ = h with g^^ being some 
entry-permuted version of g*^. This fact will be used in the following to make categorization 
of the possible behaviors of Algorithm 2 in terms of attempts made on the columns of A, from 
beginning (tact = 0, t = 1) to termination (when either a sparse solution is found giving / = 1, 
t = si + 1, or the algorithm declares no solution being found giving / = 0, t = 0), which will 
lead to the conclusion that all possible behaviors of Algorithm 2 lead to / = 1. 

Mathematically, the behavior of Algorithm 2 in terms of attempts made on the columns of A 
from beginning to termination is defined as this: it is an ordered set B which has si elements, 
and the A;-th element Bk itself is a set with the elements being the columns of A that was 
attempted for the selection of [Asci](. fc), for k = 1, . . . , Si. Specifically, if some column 
of A was attempted for the selection of [v4sci](:,fe), then G Bk, otherwise ^ Bk- 

Before making the proposed categorization, we establish some useful preliminaries. We refer 
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to / = 1 as success and / = as failure when Algorithm 2 terminates. Note that t is a 
function of tact and will be denoted by t(tact) as needed for clarification. The term "residue" 
will refer to the residue associated with the t columns [Asci](:,i:t) (which are already selected), i.e. 
h— [Asci](;,i:t) arg ming ||h— [Asoi](:,i:t)g||2- We define the notion of order-t generalized failure with 
specified [v4sci](:,i:t) (i.e. already selected t columns satisfying arg ming ||h— [y4sci](:,i:t)g||2 > 0), 
as the situation in which all the (remaining) columns of A having positive inner product with 
the residue are attempted but not selected (or selected and removed later) as [y4sci](:,t+i) and 
backtracking has to be performed, i.e. t(tact + 1) = ^(^act) — 1. as reflected by Lines 3 and 
16 in Algorithm 2. Here we allow t = and treat [Asci](:,i:o) as an empty (p x 0) matrix 
accordingly, hence order- generalized failure is equivalent to failure. Clearly, a generalized 
failure does not necessarily lead to a failure, unless it is order-0, making it a necessary but not 
sufficient condition of failure. Therefore, with specified [Asci](:.i:t), by ruling out the possibility 
of order-t generalized failure, it can be established that Algorithm 2 should result in success 
with such t columns specified. Also, notice that for some order-t generalized failure to occur, 
a necessary condition is that all possible choices for [Asci](;,t+i) (i.e. all remaining columns of 
A having positive inner product with the residue) are attempted, and consequently this becomes 
a necessary condition for failure to occur. Furthermore, due to the backtracking feature, it is 
possible for Algorithm 2 to attempt any possible choice for [Asei](:,t+i) (i-C- the columns of 
A having positive inner product with the residue). Based on Theorem 6, if some choice of 
[^sel] (■.,t+i) appended to [v4sci](:,i;t) results in non-negative LS solution, then such choice should 
have positive inner product with the residue. These further imply that with specified [Asci](:,i:t), 
if some column of A appended to [v4sci](:,i:t) results in non-negative LS solution, i.e. with 

[^sci](:,t+i) = argmingllh- [Asei](:,i:t+i)g||2 > 0, but has never been attempted 

for the selection of [Asei](:.t+i) after the termination of Algorithm 2, then Algorithm 2 should 
result in success with such specified [Asei](:,i:t). With this implication utilized below, the possible 
behaviors of Algorithm 2 will be categorized into the ones that lead to success for sure and the 
ones that may lead to generalized failures, in a recursive manner, and we finally rule out the 
possibility of generalized failures. 

Let Pq denote the set of all possible B's, i.e. all possible behaviors in terms of column 
attempts of Algorithm 2. Define Pk as the set of possible behaviors of Algorithm 2 with [AsJ(:,i:fc) 
specified as [Asci](:,i:/t)5 for k = 0, . . . , si, which is in accordance with what Pq represents and 
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induction will be enabled. Let A'^^^.+i denote the subset of in which [AsJ(. ^+1) ^ Bk+i, i.e. 
[v4sj(:,fc+i) was never attempted for the selection of [v4sci]{:,fc+i)> for /c = 0, . . . , Si — 1. For the 
base case, we consider the attempts made on selecting the first column of Ag^i, which happen 
at the instants with t = 1, regardless of what tact is, as reflected by Bi. We categorize Pq into 
Pq = P-^ U Ni with Pi n iVi = 0, according to whether [AsJ(;,i) G Bi or not: Pi denotes the 
subset of Pq in which [y4sj(. i) e Bi, i.e. i) was attempted at t = 1, A''i denotes the subset 

of Po in which [74sJ(:,i) ^ Bi, i.e. [AsJ(;^i) was never attempted at t = 1. Based on the above 
mentioned implication, since (32) is satisfied with A; = 1, A^i gives rise to success, and Pi gives 
rise to the selection of as [y4sci](:,i) (this can be justified based on Lines 9 and 12 in 

Algorithm 2). At this stage, we have some doubt if Pi will lead to some generalized failure, 
while such possibility will eventually be ruled out as we perform further categorization on Pi. 
For the inductive step, consider the attempts made on selecting the (k + l)-th column of Agd 
(with [Asei](:,i:fc) already specified), which happen at the instants with t = k + 1, regardless of 
what tact is. It can be easily verified that, the complimentary set of A^^+i in P^ is Pt+i, since 
based on (32), [A<iJ(.^fc+i) will be selected as [Asci](:,fc+i) if it is attempted. Thus we now have 
Pfc = Pfc+i U Nk+i with Pfc+i n Nk+i = for A; = 1, . . . , si — 1, and eventually we have 

Po = Ull.Nk U Ps, (33) 

with the individual sets on the right hand side being mutually exclusive. Similar to the case of 
A^^i, each Nk in (33) gives rise to success. It is also clear that P^^ gives rise to success, since 
it has As-^ specified as Asd, and As^^g'^^ = h holds with g^,^ > 0. It then follows that Pq gives 
rise to success, i.e. Algorithm 2 is able to find a sparse probability vector successfully when 
inclusion is present. 
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